The five leading causes of death in the United States from 1999 to 2014 are cancer, heart disease, unintentional injury, chronic lower respiratory disease, and stroke. The dataset includes the U.S. Department of Health and Human Services public health regions. Therefore, we can investigate the leading causes of death of each region, and develop accordingly public health policy and remedies.
The dataset we are using has 191748 observations of 11 variables. The variables include year, cause of death, state, age range, hhs region, locality, number of observed deaths, population, number of expected deaths, percent of potential excess deaths, and number of potential excess deaths. The hhs region indicates the 10 U.S. Departtment of Health and Human Servoces public health regions.
The number of observed deaths, population, number of expected deaths, percent of potential excess deaths, and number of potential excess deaths are continuous variables, and others are categorical variables. In this report, we are going to focus on potential excess deaths for locality, and how the five leading causes is distributed to the potential excess death. The percent excess deaths is calculated based on the three states that have the lowest death rate. The percent excess deaths ranges from 0 to 85.3% with a mean of 35.66%. The locality is divided by metropolitan and nonmetropolitan areas.
From the dataset, we obtain the information that the number of potentially excess deaths from the five leading causes in rural areas was higher than those in urban areas.
We then analyzed several factors that might influence the rural-urban difference in potentially excess deaths from the five leading causes, many of which are associated with sociodemographic and ecological differences between rural and urban areas.
Through statistical analysis, our report provides an interactive and straightforward view on the potentially excess deaths from the five leading causes of death in non-metropolitan and metropolitan areas.
The ultimate goal is to bring attention to preventing deaths in the rural areas through improving healthcare services and public health programs.
cod_data = read_csv("./data/NCHS_-_Potentially_Excess_Deaths_from_the_Five_Leading_Causes_of_Death.csv") %>%
clean_names() %>%
na.omit() %>%
filter(!(state == "United States")) %>%
separate(., percent_potentially_excess_deaths, into = c("percent_excess_death"), sep = "%") %>%
mutate(percent_excess_death = as.numeric(percent_excess_death), mortality = observed_deaths/population * 10000, mortality = as.numeric(mortality)) %>%
select(year, age_range, cause_of_death, state, locality, observed_deaths, population, expected_deaths, potentially_excess_deaths, percent_excess_death, mortality, hhs_region)
## Warning: Too many values at 191748 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9,
## 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
##columns removed
#"state_fips_code" "benchmark" "potentially_excess_deaths" "percent_excess_death" "mortality"
region_cod_data = cod_data %>%
select(state, locality, hhs_region, percent_excess_death) %>%
group_by(state,locality, hhs_region) %>%
summarise(mean_ped = mean(percent_excess_death)) %>%
dplyr::filter(!(state == "District of\nColumbia")) %>%
mutate(hhs_region = as.character(hhs_region))
year variable contains data collected from 2005-2015.United States in the state variable.mortality which is calculated by observed_deaths/population * 10000. This variable indicates the number of deathes observed in every 10000 people in the three geographic regions: Metropolitan, Nonmetropolitan and All.After reviewing the dataset, we filtered out entries with United States in state variable. Then we removed the percentage mark in the Percent Potentially Excess Deaths variable. We also created a variable called mortalityby calculating the quotient of observed death over population. Lastly we filtered out entries with District of\nColumbia in region variable. After selecting the variables of interest, the data is ready for analysis. When we analyze the relationships between the outcomes(mortality, percent excess death, etc) and several covariates(year, cause of death, population, etc), we group the data by the covariates and calculate the mean outcome as response.
This interactive plot examines the differences of Standardized Mortality Ratio (SMR), the proportion between observed death and expected death, of the five causes of death in each public health region. Each observation indicates for which state it belongs to with the size of which is reflective of the corresponding population. Here we compare SMR within the same geographic area. The differences of the distribution of the SMR among five causes of deaths might associate with the sociodemographic, cultural, behavioral and structural factors of the specific region.
cod_data %>%
mutate(year = as.factor(year))
## # A tibble: 191,748 x 12
## year age_range cause_of_death state locality observed_deaths
## <fctr> <chr> <chr> <chr> <chr> <int>
## 1 2005 0-49 Cancer Alabama All 756
## 2 2005 0-49 Cancer Alabama Metropolitan 556
## 3 2005 0-49 Cancer Alabama Nonmetropolitan 200
## 4 2005 0-49 Cancer Alabama All 756
## 5 2005 0-49 Cancer Alabama Metropolitan 556
## 6 2005 0-49 Cancer Alabama Nonmetropolitan 200
## 7 2005 0-49 Cancer Alabama All 756
## 8 2005 0-49 Cancer Alabama Metropolitan 556
## 9 2005 0-49 Cancer Alabama Nonmetropolitan 200
## 10 2005 0-54 Cancer Alabama All 1346
## # ... with 191,738 more rows, and 6 more variables: population <int>,
## # expected_deaths <int>, potentially_excess_deaths <int>,
## # percent_excess_death <dbl>, mortality <dbl>, hhs_region <int>
p <- cod_data %>%
plot_ly(
x = ~expected_deaths,
y = ~observed_deaths,
size = ~population,
color = ~cause_of_death,
frame = ~hhs_region,
text = ~state,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
) %>%
layout(title = "Change of Standardized Mortality Ratio in National Public Health Regions")
p
<<<<<<< HEAD
=======